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Abstract 

A parallel algorithm for the implementation of the recursive Green's function 
technique, which is extensively applied in the coherent scattering formalism, is de- 
veloped. The algorithm performs a domain decomposition of the scattering region 
among the processors participating in the computation and calculates the Schur's 
complement block in the form of distributed blocks among the processors. If the 
method is applied recursively, thereby eliminating the processors cyclically, it is 
possible to arrive at a Schur's complement block of small size and compute the 
desired block of the Green's function matrix directly. The numerical complexity 
due to the longitudinal dimension of the scatterer scales linearly with the number 
of processors, though, the computational cost due to the processors' cyclic reduc- 
tion, establishes a bottleneck to achieve efficiency 100%. The proposed algorithm 
is accompanied by a performance analysis for two numerical benchmarks, in which 
the dominant sources of computational load and parallel overhead as well as their 
competitive role in the efficiency of the algorithm will be demonstrated. 
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1 Introduction 



In 1965, Gordon Moore predicted that the number of transistors packed on a 
chip would continue to double every year, a prediction known as the Moore's 
law [1]. During the past few decades, the rapid progress of novel experimen- 
tal techniques, has resulted in scaling down the size of the integrated circuits 
on a chip according to Moore's law. Nowadays, chips with hundreds of mil- 
lions of transistors have been industrially realised and are exploited in a wide 
range of commercial applications. However, the continuous scaling down in 
size of the transistors is about to reduce their dimensions thereby entering the 
mesoscopic regime, in which the electronic wave length becomes important, 
i.e., comparable with the size of the device, and quantum effects dominate 
and define the laws of information processing [2]. The natural route towards 
future electronics is therefore to understand these effects and comprehend 
them in the design of the nanoscaled devices. The necessary condition for the 
description of these phenomena in realistic mesoscopic transistors, regarding 
computational resources, is the ability to treat systems with million degrees 
of freedom. 

The theoretical framework for the description of mesoscopic electronic trans- 
port has been established within the Landauer formalism [3], in which the 
conductance of a mesoscopic sample is in direct relation to the probability 
that an electron will transmit through it. To this end, several numerical tech- 
niques have been developed and applied to describe various physical setups. 
The most efficient method to attack coherent ballistic transport has proven 
to be the recursive Green's function (RGF) approach. The general framework 
for this approach can be found in Refs. [4,5,6] and depending on the emphasis 
of the individual scattering problem, alternative numerical techniques can be 
applied. Therefore, techniques such as the boundary element method [7] , with 
an emphasis on the arbitrary geometry of the scattering region, or the mod- 
ular Green's function method [8], in which the scattering region is initially 
decomposed in modules which are finally joined via the Dyson equation, have 
been developed to take into account the particular geometrical features of the 
scattering problem. Recently, a RGF technique has been apphed to describe 
scanning probe experiments [9]. This technique describes tunneling, through 
the STM tip, which comprises the whole scattering area but scales equally well 
with the standard RGF method. As an alternative solution to improve the 
efficiency and consequently the capability to treat larger systems, approxima- 
tions in the Schrodinger eigenvalue problem, as in the contact block reduction 
method [10], have been employed to treat multi-terminal three-dimensional 
problems with relatively good accuracy. 

The aim of this paper is to present a parallel algorithm for the computation 
of the electronic transmission probability, within the framework of the RGF 
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method. The parallehzation will allow us to treat large systems with millions of 
degrees of freedom and will be particularly efficient to handle highly complex 
modular scattering structures. The paper is organised as follows. In section 
2 we discuss the basic guidelines of the coherent scattering formalism and 
the desired computational goal to be achieved. In section 3 we construct the 
parallel algorithm and calculate its numerical complexity. Section 4 contains an 
analysis of the performance and scalability of the applied parallel algorithm for 
certain numerical benchmarks. Finally, section 5 draws our main conclusions. 



2 Basic guidelines of the coherent scattering formalism 



Coherent scattering formalism implies that the conductance of a mesoscopic 
sample attached to two reservoirs (Fig. 1) is proportional to the quantum- 
mechanical probability T{E) that an incoming electron at a Fermi energy 
E in the reservoirs will transmit through it. To evaluate the transmission 
probability T{E) one has to solve the Schrodinger equation: 

{E-H{r) + ir])G^{r;r') = 6{r-r') (1) 



where H{r) is the Hamiltonian and G^{r; r') is the retarded Green's function 
operators of the open system (scatterer + reservoirs). In the following we 
restrict ourselves to two-dimensional {2D) transport. 




X 



Fig. 1. Two-dimensional scattering region attached to two reservoirs discretized on 
a uniform lattice with constant a. 

To proceed with the calculation of T{E) we discretize our space on a uniform 
lattice with constant a. In order to represent the Hamiltonian operator H{r) 
we use the tight-binding model assuming only nearest neighbor interactions 
[4]. In this case the Hamiltonian can be written: 

H(r) = J2 |r)er(r| + E |r)K-,Ar(r + Ar| (2) 

r r,Ar 
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where er is the on-site energy at the position r = {x, y) with x — na and 
y = ma, n, m G N, Ar represents the vectors from r to their nearest neighbor 
sites and K-,Ar is the nearest neighbor hopping energy. The dispersion relation 
for the 2D discretized lattice reads for a constant on-site energy: 

E2d{^) = 4^0 - 2VoCos{k^a) - 2VoCos{kya) (3) 



where k = {kx,ky) is the electron's wavevector and Vq — — /i^/(2m*a^) is 
the matrix hopping element linking each site to its nearest neighbor. In the 
limit a — > we recover the usual parabolic relationship of a free particle in a 
continuum space. 

The full tight-binding Hamiltonian of the open system (scatterer -|- reservoirs) 
can be then decomposed in the following block form: 



H(r) = 



Vl Hs Vr 



where the Hamiltonian Hg describes the electronic motion in an arbitrary 
scattering region which is coupled to two external reservoirs from the left and 
right, via the semi- infinite matrices Vl and Vr, respectively. The Hamiltonian 
operators Hl and Hr are of infinite size and describe the electronic flow within 
the reservoirs. 

Following Datta [2] one can accordingly partition the overall retarded Green's 
function operator of equation (1). It is then possible to obtain for the retarded 
Green's function operator of the scatterer the following expression, 

G{E) = [EI - Hs - ^n{E) - ^i.{E)]-' (4) 



which takes into account the effect of the couphng to the reservoirs, via the so 
called self-energy matrices Y^j^iE) = Vk;Gk(£^)Vk due to the left (K = L) 
and right (K = R) reservoir. The function Gk is the retarded Green's function 
operator of the reservoir K, i.e., Gk(-E) = [{E + irj)! — Hk]~^. 

Due to the tight-binding's model discretization, the space of the scattering 
region now consists of n = 1,2...,N slices along the x-direction each of 
which consists of m = 1,2, . . . , M sites along the y-direction. The matrix 
A = El — Hs — Ti-r{E) — T,-L,{E) we want to invert in order to evaluate G{E) 
is a, N xN block tridiagonal matrix [4] whose elements are the blocks Ay each 
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of which is of size M x M: 
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The expression for the evaluation of T[E) can be given in a compact form 
within the Fisher-Lee relation [11]: 



T{E) = Tr[Tn.{E)G{E)Ti^{E)G\E)] 



(5) 



where Ti^[E) = i[Y!,j^{E) — is the strength of the coupling of the 

reservoir K to the scatterer. Due to the fact that the reservoirs are coupled 
only to the left and right of the scatterer, the blocks that correspond to the 
left interface of the scatterer with the lead, i.e. the upper left block (Tl(-E') 
of Tii^lE), and to the right interface of the scatterer with the lead, i.e. the 
down right block crii[E) of Sr(-E), are the nonzero blocks of the matrices 
Sk(-E). Therefore, the total self-energy due to the right and left reservoir has 
the following structure: 
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Due to the above mentioned structure of the self-energy matrices, it becomes 
clear that only the upper left block of T-l,{E), 7l(-E) = i{ai,{E) — al^{E)) 
and the down right block of TniE), 7r(S) = i{an{E) - al^{E)) are nonzero. 
Hence, the trace of the product of the four matrices occuring in equation (5) 
simplifies to: 



TiE) = Tr[jniE)GrMEhL{E)Gl^{E)] 



(6) 
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Equation (6) implies that only the upper right block of the inverse of A, 
^1 N = Gi N is necessary for the evaluation of T{E). The ultimate goal is 
therefore to compute A^^. 



3 The parallel algorithm 

3. 1 Prerequisites 

The overall scattering problem, as discussed in section 2, is summarized to a 
N X N block tridiagonal matrix A = £"1 — H — 'E-r{E) — Sl(-E') of which each 
block is of size M x M, where the goal is to compute the upper right block of 
the inverse of A, A^j^f. 

The algorithm that we pursue should possess the following properties: 

(1) Storage requirements should be restricted to a small number of blocks of 
size M X M. 

(2) The number of inversions and multiplications of the M x M blocks Ay, 
which scale as M^, should be proportional to A^. This corresponds to the 
numerical complexity of the sequential RGF technique in the asymptotic 
limit of large N and M: 

(3) Exploit the fact that the matrix A is Hermitian, i.e., for the off-diagonal 
blocks is claimed that A;j = Aji. 

(4) The algorithm should be parallehzable. 

3.2 Preparations 

3.2.1 Change of the inverse under permutation 

Let Pij be an elementary permutation matrix with the following properties: 

(1) Set A = Pij A, then A is identical to A except that rows i and j are 

interchanged. 

(2) Set A = APy, then A is identical to A except that columns i and j are 
interchanged. 

(3) P? = Pij = Pji. 

(4) Pij • Py = I, i.e., Pij is orthogonal and self-inverse Pij = Pij^. 
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We call P = Pi„j„ . . . Piiji a permutation matrix. Then P -"^ = (Pi„jn • • • • • Pii ji) — 
■^iiJi ■ • • • ■ PiVjn ~ -PiiJi ■ • • • ■ Pinjn = ^ow if We apply row and column 
permutations to the matrix A, A = PAP'^ then for the inverse we have that 
A-i = (PAPT)-I = p-^A-ip-i = PA-ipT. 

The above imply the following two alternative paths for the computation of 

-^l,N- 

(a) Starting from A we compiitc the inverse of it. Then by applying the appro- 
priate row and column permutations, through operation of the permutation 
matrices, it is possible to shift the desired upper right block A^j;^ another 
position of the inverse. Respectively, the down left block of A is also shifted. 
This first path can be illustrated graphically as follows: 




compute 
inverse of A 




apply row/col 
permutations 



PA-1] 









(b) Alternatively, if we start by applying row and column permutations in 
the initial matrix A, then we can shift the upper right block Ai n into an- 
other position. If we compute the inverse of the new matrix then the desired 
block A^jsf will be located at the same position. Graphically, this second path 
implies: 




apply row/col 
permutations 




compute 
inverse of A 




Therefore, the diagram implies that computation of the desired block of the 
inverse matrix by following path (a) is equivalent to the computation of 
A^N by following path (b) . 
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3.2.2 Expression of the inverse via the Schur complement 



Let any matrix A with a general 2x2 block structure: 



All A 



A 



12 




Then the inverse of A in block form is: 



( 



All + Aii^AiaS-^AaiA 
-S-iA2iAri' 
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-ArMiaS 

s-1 



where S = A22 — A21A11 A12 is the so called Schur's complement block. 

Hence, together with the permutation Lemma (section 3.2.1) we arrive at the 
following statement: 

// the block Ai^n is transfered to the block A22 via permutation transformation 
then the desired block A^^ of the inverse can be obtained from the inverse S""*" 



3.3 Parallel recursive algorithm 

To construct the parallel recursive algorithm for the computation of A-^ we 
proceed as follows. By starting from the matrix A in its original block tridi- 
agonal form, we induce an additional block structure thereby distributing the 
domains of the scattering region to p processors as shown in Figure 2. This 
secondary level block structure, due to the scatterer's domain decomposition, 
consists of p blocks, which in turn contain ni,n2, ■ ■ ■ ,np blocks respectively 
and p + 1 elementary blocks which correspond to the interface slices of the 
decomposed domains. Additionally, we encounter blocks that couple the in- 
terface slices with the p blocks corresponding to the scatterer's domains. The 
position of the upper right block A^^ that is required to be computed is 
indicated in Figure 2. 

In the next step we reorder rows and columns, formally through permutation 
matrices, and we arrive at the reordered matrix with the structure of Figure 
3. The reordered matrix has the 2x2 block structure. 



ofS. 
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Fig. 2. Original block tridiagonal matrix with new secondary level block structure 
due to processor subdivision. 



and moreover, the desired block to be computed is transfered to the upper right 
corner of A^^. Therefore, in order to compute A^^, it suffices to compute 
S = A^^ - A^^(A")"^A^^ and extract the upper right block of S~^. The 
computation of S results again in a block tridiagonal matrix and the algorithm 
can be applied recursively, i.e., by knowing S and applying cyclic reduction 
among the processors which participate in S, we can arrive recursively to a 
matrix that is small enough to compute A^]^ directly. 

Analytically, the stages to which the parallel RGF algorithm is divided as well 
as the corresponding numerical complexities are the following: 

(1) First Stage: Scatterer's domain decomposition and computation of S 
The scatterer is decomposed in domains with ni,n2, ■ ■ ■ ,nk, ■ ■ ■ ,np blocks. 
Each domain corresponds to one of the altogether p processors participat- 
ing in the computation and additionally, there are p+1 elementary blocks 
of each interface slice between the domains (Fig. 2). At this point we have 
to note that the last processor stores two interface blocks, i.e., App and 
Ap^-^p_,_j^. Then we reorder rows and columns such that the matrix A 
has the block structure of Fig. 3. Subsequently, the algorithm performs a 
block Gaussian elimination adapted to the special sparse block structure 
of Fig. 3, i.e., it proceeds by eliminating A^^ using A^^. Analytically, the 
steps of the block Gaussian elimination applied hereby: 
V processor k 
for i = 1 . . .Hk 
B = (ALI)h' 
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Fig. 3. Reordered matrix A after row and column permutations. 
(Akk)i+i,i+i = (Akk)ii - (Akk)[i+iB(A^^j,)i,i+i 

(^kk)i+l = ~(Akk)i,i+lB(A,^j^)i 

if i == rik + 1 

B = (Akk)nk,nk 

^kk ~ l^kkJnk l^kkJnk-'-'v'^kk/'nk 

Arr _ _/ Airu jiA^r 

^k,k+l — l^kkJnk^"^k,k+l 

The numerical cost for eacli processor scales witti Up inversions of M x M 
blocks and requires 6 ■ Up multiplications of matrices (see the algorithm 
above). With respect to the storage only a few auxiliary blocks of size 
M X M, independent of n^, are required. Hence, each processor at the 
end of the first stage of the computation has stored the diagonal A^^ and 
off-diagonal A^^_,_-^ block of the Schur complement. At this point we note 
that the notation used in the subscript of the blocks of S is identical to the 
one of the blocks of A^'" for convenience. The last processor computes, in 
addition to the two previously mentioned blocks, the last block Ap^^ p_^-^. 
The numerical complexity for each processor scales, in the limit of large 
N and M, with: 

Ci ^ 7nkM^ ^ 

P 

After the completion of the first stage, the Schur's complement block S 
has been computed with the form of distributed blocks A^^ and A^^_,_j^ 
among the processors participating in the computation. Moreover, S has 
a block tridiagonal structure and is Hermitian: 
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(2) Second Stage: Cyclic reduction of the processors participating in the Schur's 
complement block 

To proceed further, we exploit the block tridiagonal structure of S. To 
this end we apply a recursive technique called cyclic reduction [12]. The 
implementation of this technique requires successive reordering of the 
processors in such a way that in each step the Schur's complement block 
is half the size as before. The first step of the cyclic reduction algorithm 
is shown in Fig. 4. 
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Fig. 4. Reordering according to the cyclic reduction algorithm for a Schur's com- 
plement block of size (p + l) x (p+l). The size of S after the applied block Gaussian 
elimination is reduced to half of the preceding size. 
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Wc observe that the reordered block structure possesses again the 2x2 
structure of the matrix A. Therefore by ehminating the off-diagonal block 
using the upper-diagonal, i.e., the procedure of the first stage, we arrive 
at a new Schur's complement block of half the size as the preceding one. 
By applying this procedure recursively, after log2{p) steps we arrive at a 
3x3 block matrix, of which the upper-right diagonal block of the inverse 
is the desired Aj^^ one. At this point we should remark that in each re- 
cursive step, the first and the last processor should always participate in 
the new resulting Schur's complement block, as shown in Fig. 4. This con- 
dition ensures that the desired block A^\^ is always located in the upper 
right corner of S. In this second stage of parallelization, each recursive 
step requires one inversion and four multiplications for the calculation of 
the diagonal A^^ the fill-in A^-^_,_j^ blocks of the resulting Schur's 
complement block (see algorithm of the first stage applied to the block 
structure of Fig. 4). The numerical complexity of the second stage scales 
as: 

C2 ^ 5log2{p)M^ 

After log2{p) recursive steps, we are left with a 3 x 3 block matrix of 
which the first row, i.e., blocks A^^ and A^2) stored in the first 
processor and the rest two rows, i.e., blocks App, App_,_-^ (second row) 
and Ap^-^p_,_j^ (third row), are stored in the last processor. The upper 
right block of the inverse of this 3x3 block matrix is the desired Aj^j;^ 
which can be straightforwardly computed. 
(3) Third Stage: Computation of the transmission coefficient 

At the last stage, there remain a few multiplications c of the blocks that 
are included inside the Fisher-Lee relation and are all known for the 
evaluation of T{E). These operations are performed sequentially from 
the first processor. The numerical complexity for this last stage can be 
evaluated as, 

C3 fti cM^ 

and since c is a small constant, in the limit of large N, C3 can be absorbed 
in Ci. 

The numerical complexity of the parallel algorithm scales as: 

N 

Cpar(A^, M, p)^Ci + C2 + Cs^ + 5log2pM^ (7) 



and the corresponding sequential {p — 1) one, as: 

Cseq(iV,M) ^ 7NM^ 
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We should remark that the algorithm developed here holds equally for scat- 
tering regions with complex boundary conditions, i.e., blocks Ay with varying 
sizes, and can be generalized to the geometry of 3D scatterers in a straight- 
forward manner. 



4 Numerical benchmcirks 

4.1 Metrics for the analysis of performance and scalability 

In this section an analysis of the performance and scalability for two specific 
numerical benchmarks will be pursued. This is required in order to test the 
models for the numerical complexity we derived so far and to demonstrate 
a measure for the capabilities and optimized use of the proposed algorithm. 
To proceed with such an analysis it is necessary to define some characteristic 
quantities for our parallel algorithm following Ref. [13]. Firstly, we define the 
problem size: 

W{N, M) = INM^ 

which is the number of numerical operations in the sequential algorithm {p — 
1), i.e., the RGF approach, and is also equal to the serial run time Tg if a unit 
of time corresponds to each numerical operation. The cost of simulating the 
parallel algorithm on a single processor is: 

pTp{N, M,p) = pCp^{N, M,p) = 7NM^ + 5plog2ip)M' 

where Tp is the parallel run time corresponding to Cpar(-/V, M, P) if we assume 
a unit of time for each computational step. The overhead function Tq of the 
parallel algorithm is defined as: 

To(M,p) =pTp-W = hplog2{p)M^ 

and determines the part of its cost that is collectively spent by all processors 
compared to the sequential algorithm. The sources of overhead of a parallel 
system can be in general attributed to interprocessor communication, load 
imbalance and extra computational time due to a part of the program that is 
not parallelizable. In our algorithm the dominant contribution to the overhead 
results from the amount of operations during the cyclic elimination of the 
processors. The extra computational time required for the evaluation of the 
Fisher- Lee relation (this is the only not parallelizable part) can be neglected in 
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the limit of large A^. As far as load imbalance is concerned, the two numerical 
benchmarks to be investigated will show a different significance of this source 
of overhead. Finally, we define the efficiency of the parallel algorithm as: 



W 7NM^ 



1 




pTp p (j^M^ + 5 log2{p)M^) 1 + 



bplog2(p) 
7N 



Prom this relation, we conclude that the efficiency is independent of the size of 
blocks M and depends only on the longitudinal length of the scatterer N and 
the number of processors p participating in the computation. Moreover, by 
scaling appropriately N with p, it is possible to maintain the efficiency fixed, 
a property met in scalable parallel algorithms. From Eq. (8) we can define the 
isoefficiency function: 



where K — F/(l — F) is given for a specific F. For fixed K we can arrive at 
the following relation for N and K: 



Therefore, our algorithm can be cost-optimal if we choose = ^Kplog2{p) 
and scalable if we increase N with rate 0{plog2{p))- On the other hand, for a 
fixed size problem, i.e., keeping N and M fixed, we observe that the efficiency 
decreases with increasing p as a consequence of Amdahl's law (see Eq. (8)). 
Here some final remarks are in order. In the quantities defined so far, we have 
assumed lattices of unique size N x M ior the discretization of the scattering 
regions (perfectly load balanced problems). In addition, the time spent for 
the interprocessor communications is neglected. This is due to the increased 
granularity of the block tridiagonal system, resulting in a better efficiency of 
the parallel algorithm. 

4-2 Sinai billiard in a magnetic field 

The first numerical benchmark to test the performance of our algorithm is 
the Sinai billiard in a homogeneous magnetic field. Figure 5 shows the setup. 
Modified Sinai billiard provide a class of systems for testing the correspondence 
between quantum and classical transport. Its simple geometry allows the use 
of lattices with a unique size for the discretization, leading therefore to a 
perfectly balanced problem with respect to the numerical work loaded to each 



W = KTr 







N ^ - 



Kplog2ip) 



(9) 
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processor. It represents therefore an excellent ground to test the models for 
the complexity we developed in subsection 4.1. 



y = mai' 



Fig. 5. Setup of a Sinai billiard attached to two reservoirs with n = 0, 1,...,A^ — 1 
slices of m = 0, 1, . . . , M — 1 sites each, used in the fixed-size efficiency calculations. 
The ratio of the two dimensions is ^ = |. 

The first setup to test the performance of our algorithm uses a 400 x 250 
lattice for the discretization of the Sinai billiard (ten times resolved compared 
to the one of Figure 5). The first type of analysis consists of keeping the lattice 
fixed and studying the efficiency of the problem with increasing the number of 
processors. We remind the reader that the total cost of the parallel algorithm 
is dominated by the cost for the evaluation of the Schur's complement block 
and the cost due to the cyclic reduction of the processors (see Eq. (7)). Table 
1 shows the times measured for the evaluation of T[E, B) at a fixed energy E 
and magnetic field B. 

Table 1 

Measured time (Time) and efficiency {F) as a function of the number p of the 
processors for a Sinai billiard in a magnetic field with fixed size = 400 and 
M = 250. 
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0.906 


40 


59.11 


0.729 


128 


34.27 


0.393 



At this point we note that the system used for the time measurements has been 
a Linux cluster of 256 nodes with Dual AMD Athlon 1.4 GHz processors of 2 
GB RAM each. Efficiency is 1.0 for p = 1 and gradually decreases with p. This 
is due to the fact that with increasing p, the term in equation (7) proportional 
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to log2{p), dominates with respect to the other term that decreases with — , 
thereby decreasing the efficiency of the proposed algorithm. 

Figure 6 shows the efficiency F as a function of the number p of proces- 
sors according to the performed time measurements (dots) compared to the 
analytical curve of Eq. (8). We observe that the agreement between the theo- 
retical model and the measurements is very good. Therefore, we conclude that 
the dominant sources of numerical load have been succesfuUy identified and 
weighted. Further sources of overhead, such as the time required for interpro- 
cessors' communication, could be neglected as the work load is dominated by 
the amount of numerical operations that scale with M^, i.e., multiplications 
and inversions of M x M blocks. 
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Fig. 6. Efficiency as a function of the number p of processors. The dots correspond 
to the measured efficiency and the solid curve to the theoretical model employed. 

The next step in our analysis is to perform a scaling size experiment. The 
aim of this test, is to scale the size of the problem such that the efficiency is 
kept fixed. As we saw from Eq. (8) the efficiency is independent of the size of 
the transversal dimension M and depends only on the size of the longitudinal 
dimension and the number of processors p. Therefore, by scaling appropri- 
ately N with p it is possible to arrive at a fixed efficiency F of the algorithm. 
According to equation (9) for p = 2 processors the efficiency can be 0.848 if 
we choose A^ = 8. If we keep increasing the number of processors p and the 
size of the system A", keeping M fixed, according to the relation: 

^/ ^ j^ P'io92{p') 
plog2{p) 

where A^' and p' arc the new size of the system and the new number of pro- 
cessors respectively, then we expect that the efficiency will stabilize around 
84.8%. Table 2 shows the efficiency for the scaled size problem. 

We observe that the efficiency is stabilized between 0.81 and 0.85 thereby 
confirming our prediction. The sources of these slight deviations could be at- 
tributed to some enhanced contributions of time spent in interprocessor com- 
munications. Therefore our models provide a reliable source for the estimation 
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Table 2 

Efficiency (F) as we increase the longitudinal dimension A'^ of the billiard with the 
number of the processors p according to N = 0{plog2{p))- We keep M = 100 fixed. 



N- n 






7? 

± 


8- 9 
0, Z 


1 1 
J. . J. 


u.uo 


U.O-LU 


32; 4 


4.76 


1.44 


0.826 


96; 8 


14.27 


2.17 


0.822 


256; 16 


38.32 


2.82 


0.849 


640; 32 


95.25 


3.54 


0.841 


1536; 64 


228.79 


4.27 


0.837 


3584; 128 


534.06 


5.04 


0.828 


8192; 256 


1222.07 


5.82 


0.82 



of the computational cost. Table 2 shows that the larger the size of the sys- 
tem N, the larger becomes the efficiency. Therefore, our parallel algorithm 
is suitable for large systems, in particular of enhanced longitudinal dimen- 
sion. Scattering problems with complex structures could be disentangled into 
modules with arbitrary complexity, of which the computation could be done 
efficiently by one processor. Cyclic reduction among the processors would join 
the information of the individual modules. If the computational complexity 
of a module is particularly enhanced for one processor, then more processors 
could be employed. 



4-3 Antidot inside the scatterer 

The second numerical benchmark corresponds to a category of scatterers with 
enhanced complexity. It consists of a Sinai billiard with a centered antidot of 
circular shape. This setup has been chosen for simulations in Ref. [14]. The 
numerical challenge imposed hereby is the exact reproduction of the antidot 's 
circular shape in the continuum limit. 

Figure 7 shows the discussed geometry. Subfigure 7- (a) shows the open ge- 
ometry and dimensions of the Sinai billiard, while in 7-(b) the isolated Sinai 
billiard is discretized on a 49 x 49 grid of points. On such a small grid the 
antidot has, on the scale of Fig. 7-(b), the shape appearance of a polygon. 
Subfigure 7-(c) shows the same setup of the Sinai billiard but on a grid which 
is four times resolved compared to 7-(b), i.e., a 399 x 399 grid. The latter is 
going to be our fixed input size for the time measurements as we increase p. At 
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Fig. 7. (a) Open scattering geometry of a Sinai billiard with a centered antidot of 
circular shape. Subfigure (b) shows the isolated scatterer on a 49 x 49 grid of points 
and width W = lOo. Subfigure (c) shows the same setup but four times resolved. 
The thickness of the border lines in (b) and (c) provide a measure of the lattice 
constant. 

this point we remark that the antidot has hard wall boundaries, i.e., the sites 
which form the antidot are excluded from the computation, thereby leading 
to blocks Aij with varying dimensions. Table 3 shows the efficiency measured 
for the evaluation of T{E) at a fixed energy E as a function of p. 

Table 3 

Measured time (Time) and efficiency (F) as a function of the number p of the 
processors for a Sinai billiard with an antidot placed centrally in it. The lattice 
N = 399 and M = 399 is kept fixed. 



p 


Time (sec) 


_ F _ 


P 


Time (sec) 


_ F _ 


P 


Time (sec) 


_ F _ 


1 


13490.83 


1.0 


14 


1201.49 


0.802 


48 


417.8 


0.673 


2 


6791.23 


0.993 


16 


1058.31 


0.797 


56 


379.9 


0.634 


4 


3917.2 


0.861 


20 


855.45 


0.789 


64 


343.87 


0.613 


6 


2689.56 


0.836 


24 


734.14 


0.766 


80 


271.07 


0.622 


8 


1974.65 


0.854 


28 


655.5 


0.735 


96 


267.04 


0.526 


10 


1649.51 


0.818 


32 


571.54 


0.738 


112 


226.92 


0.531 


12 


1404.99 


0.800 


40 


462.83 


0.729 


128 


224.37 


0.47 



The efficiency decreases with increasing p as expected. We should note that 
for these measurements equidistant domains, with respect to the longitudi- 
nal dimension, have been distributed among the processors. However, due to 
the antidot 's boundaries, it becomes clear that this kind of distribution leads 
to an inevitable load imbalance. The domains that include sections of the 
antidot are described by blocks of smaller size, resulting thereby in reduced 
computational load for the corresponding processors. For p = 2, we observe an 
efficiency very close to 100%. This is a result of the symmetry of the geometry 
of the setup, which results in a load balanced problem for this specific num- 
ber of processors. If we further increase p then the efficiency falls abruptly. 
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This result is attributed to the intensive load imbalance for few number of 
processors. To remedy this problem we have to choose a non-uniform domain 
decomposition of the scattering region, leading, thereby, to a more fair work 
load for all processors. For a larger number p, however, this problem becomes 
much less intense, since the total cost is multiply distributed in fairly small 
pieces of numerical load and the inequality among the processors, with respect 
to the load they share, significantly reduces. Therefore, for rather large p, load 
imbalance is not a significant source of parallel overhead, however, deviations 
compared to a load balanced setup are still evident (see below) . 

To analytically calculate the efficiency of the parallel algorithm for the setup 
in discussion, it is necessary to take into account the circular shape of the 
antidot. For this purpose, we divide the scatterer in two sections. One section 
of which the numerical cost scales with A^i x arithmetic operations, where 

A^i the number of slices outside the antidot, and a second one of which its 

N2 

computational load scales with Mf where Mj is the varying size of the 

blocks of each of the N2 shces that compose the antidot. Therefore, the size 
of the scattering problem is: 

N2 

W{N, M) = INiM^ + 7Y,M^ 



Moreover, we assume that at the first stage of parallelization, the work W is 
distributed uniformly among the processors and that at the second stage the 
processors that participate in the cyclic reduction arc weighted appropriately, 

with respect to the load that corresponds to them. This is translated to the 

N2 

fact that |p processors possess a work load that scales with J2 and |p 

processors possess a work load that scales with M^. Therefore, the cost for 
the parallel algorithm will be: 

N2 o 2« ^2 

pTp = 7N,M' + 7 ^ + 3plog2{^)M^ + 2plog2{-^) ^ Mf 
i=i ^ ^ i=i 



The efficiency, which is no longer independent of the size of the transversal 
dimension M, will be: 

N2 

F^ — = (10) 

r^P N2 N2 ^ ' 

N,M^ + E + fpZo^2(f )M3 + fpZo^2(f ) E Mf 

i=l 1=1 



Figure 8 shows the measured efficiency (dots) as a function of p. We observe 
a rather abrupt decrease of F for a small number of processors p > 10 which 
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smoothens for larger p. The solid curve of Figure 8 represents the analytical 
model of Eq. (10), calculated for the 399 x 399 grid of subfigure (7)-c. 
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Fig. 8. Efficiency F as a funtion of the number p of processors. The dots correspond 
to the measured efficiency and the solid curve to the theoretical model derived to 
take into account the special geometry of the setup. 

The agreement with the measurements is quite well, however, deviations for 
p > 2 are evident. For p = 2 the prediction agrees due to the symmetric load 
share to the two processors for this problem. For p > 2, deviations are apparent 
due to the assumptions within the derivation of our model. Namely, neither 
does W distribute itself evenly among the processors (load imbalance) nor is 
the computational load due to the cyclic reduction weighted exactly among 
the processors, as we assumed. To remove the first assumption one should 
proceed to an uneven domain decomposition with respect to the processors, 
which would vary depending on p. We conclude thereby, that in a scattering 
problem of complex geometry, the strategy to be followed in order to optimize 
the efficiency of the algorithm, regarding the load that the processors share, 
should take into account the particular geometric features of the scatterer. 



5 Conclusions 

A parallel algorithm for the implementation of the RGF method has been 
developed. The structure of the algorithm is mainly based on an initial do- 
main decomposition of the scattering region due to processors' subdivision 
and recursive computation of the Schur's complement block through cyclic 
elimination of the processors. The computational cost due to the longitudinal 
dimension of the scattering region scales linearly with p. However, the cost due 
to the cyclic elimination, prevents us from achieving an efficiency of 100%. To 
demonstrate the efficiency of the parallel RGF algorithm, we proceeded with 
an analysis of the performance, scalability and sources of overhead for two 
specific numerical benchmarks. The first numerical benchmark corresponds to 
a perfectly load balanced setup, such as a Sinai billiard in a magnetic field, 
and the derived model is in very good agreement with the measurements. The 
second numerical example contained an additional geometrical challenge, be- 
ing the exact reproduction of the circular shape of an antidot with hard wall 
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boundaries in the centre of a Sinai billiard. The computation hereby required 
manipulation of blocks with varying sizes leading to a nonuniform numerical 
load for the processors participating in the computation. A model adapted to 
the special geometry of this problem has been employed, which exhibited its 
geometric peculiarities and indicated the additional source of overhead due 
to load imbalance. The effect of the latter can be reduced by a selection of 
non-uniform decomposed domains distributed to the processors, based on the 
numerical cost. From our analysis it became apparent that the parallel RGF 
technique developed here, is particularly suitable for modular scattering struc- 
tures of high complexity. Parallelization in this context gives the freedom to 
decompose the scatterer into modules, the computation of each can be effi- 
ciently performed by one processor. The optimized distribution of modules to 
processors depends on their individual complexity. In case, their complexity is 
enhanced, more than one processors could be employed and the corresponding 
computational load should be shared according to the individual features of 
the module. 

P. S. D. and P. S. gratefully acknowledge illuminating discussions with G. Fa- 
gas. P. S. D. also acknowledges financial support from DFG in the framework of 
the International GraduiertenkoUeg IGK 710 "Complex processes: Modehng, 
Simulation and Optimization" . 
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